Course Project Notebook: Analysis of Crunchbase¶
Our main goal is to study trends and patterns of big companies as well as upcoming startups in tech. The startup culture has lead to an enormous growth in the tech industry and generated some of the most innovative products and services in history.
We have collected data from a website called Crunchbase which lists all businesses and startups in the tech industry and provides information on their respective category, investments, fundings, acquisitions and IPOs among many other features.
We aim to cluster VC and investor groups in certain categories, predict funding potential for upcoming startups and investment opportunities for investors. We will also use various models for feature selection to predict the aforementioned characteristics. Overall, our objective is to gain a startup as well as VC-side perspective of funding from crunchbase data.
We plan to use k-means and LSI for clustering investors and try a variety of regression approaches such as linear regression, GLMs or GAMs to determine funding and investment potential.
Ansuya Ahluwalia (ansuya@cs.ucla.edu) -- exploratory data analysis, graph mining
Ashwini Bhatkhande (ash@cs.ucla.edu) -- machine learning, production of final notebook
Raghav Mehrish (rmehrish@ucla.edu) -- data extraction, machine learning
Shivin Kapur (shivinkapur@ucla.edu) -- data modeling, production of final notebook
- Companies
- Acquisitions
- Rounds
- Investments
- Exploratory Data Analysis
- Python : Pandas, numpy, matplotlib.pyplot
- R : data.table, plyr, ggplot2, reshape2
- Data Modeling
- R : rmisc, MASS, plyr, splines, data.table
- Predictive Modeling
- Python : pandas, numpy, sklearn, pylab
- R : caret, kernlab, plyr, e1071, MASS, nnet, knn3
- Graph Mining
Exploratory Data Analysis¶
/Users/Work/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py:1070: DtypeWarning: Columns (9) have mixed types. Specify dtype option on import or set low_memory=False.
data = self._reader.read(nrows)
Out[20]:
<matplotlib.axes.AxesSubplot at 0x108d38d50>
Out[21]:
<matplotlib.legend.Legend at 0x107b619d0>
Out[24]:
<matplotlib.axes.AxesSubplot at 0x10672cd50>
Out[23]:
<matplotlib.axes.AxesSubplot at 0x1066b0550>
company_category_code company_country_code funding_round_type funded_year
11635 web USA venture 2010
11636 web USA venture 2011
11637 web USA venture 2012
11638 web USA venture 2013
11639 web USA venture 2014
11640 web ZAF series-a 2012
total_amount
11635 244010490
11636 416615077
11637 191888230
11638 121948356
11639 170653603
11640 600000
[1] "name" "permalink" "homepage_url"
[4] "category_code" "funding_total_usd" "status"
[7] "country_code" "state_code" "region"
[10] "city" "funding_rounds" "founded_at"
[13] "founded_year" "investor_permalink" "investor_name"
[16] "investor_category_code" "investor_country_code" "investor_state_code"
[19] "investor_region" "investor_city" "funding_round_type.y"
[22] "funded_at.y" "funded_month.y" "funded_quarter.y"
[25] "funded_year.y" "raised_amount_usd.y" "quarters"
[1] "category_code" "status" "country_code"
[4] "funding_rounds" "founded_year" "investor_category_code"
[7] "investor_country_code" "funding_round_type.y" "funded_year.y"
[10] "quarters" "labelNum"
[1] "category_codeadvertising" "category_codebiotech"
[3] "category_codeecommerce" "category_codeenterprise"
[5] "category_codemobile" "category_codesoftware"
[7] "category_codeweb" "statusoperating"
[9] "country_codeUSA" "funding_rounds"
[11] "founded_year" "investor_country_codeUSA"
[13] "funding_round_type.yseries.a" "funding_round_type.yseries.b"
[15] "funding_round_type.yseries.c." "funding_round_type.yventure"
[17] "funded_year.y" "quartersQ2"
[19] "quartersQ3" "quartersQ4"
[1] "category_codeadvertising" "category_codebiotech"
[3] "category_codeecommerce" "category_codeenterprise"
[5] "category_codemobile" "category_codesoftware"
[7] "category_codeweb" "statusoperating"
[9] "country_codeUSA" "funding_rounds"
[11] "founded_year" "investor_country_codeUSA"
[13] "funding_round_type.yseries.a" "funding_round_type.yseries.b"
[15] "funding_round_type.yseries.c." "funding_round_type.yventure"
[17] "funded_year.y" "quartersQ2"
[19] "quartersQ3" "quartersQ4"
[1] "x"
[1] "x"
[1] 164606 20
[1] 54865 20
[1] 164606 1
[1] 54865 1
['category_codeadvertising' 'category_codebiotech' 'category_codeecommerce'
'category_codeenterprise' 'category_codemobile' 'category_codesoftware'
'category_codeweb' 'statusoperating' 'country_codeUSA' 'funding_rounds'
'founded_year' 'investor_country_codeUSA' 'funding_round_type.yseries-a'
'funding_round_type.yseries-b' 'funding_round_type.yseries-c+'
'funding_round_type.yventure' 'funded_year.y' 'quartersQ2' 'quartersQ3'
'quartersQ4']
['category_codeadvertising' 'category_codebiotech' 'category_codeecommerce'
'category_codeenterprise' 'category_codemobile' 'category_codesoftware'
'category_codeweb' 'statusoperating' 'country_codeUSA' 'funding_rounds'
'founded_year' 'investor_country_codeUSA' 'funding_round_type.yseries-a'
'funding_round_type.yseries-b' 'funding_round_type.yseries-c+'
'funding_round_type.yventure' 'funded_year.y' 'quartersQ2' 'quartersQ3'
'quartersQ4']
Random Forest
Pred [1 5 5 ..., 2 2 1]
Mean : 0.909742
Feature Importances [ 0.01507738 0.01305174 0.01320477 0.01932507 0.01867471 0.02129487
0.0117261 0.03326322 0.02709368 0.19284599 0.22825277 0.02048339
0.03373769 0.03094636 0.04350878 0.0150039 0.16409708 0.03255682
0.03360408 0.03225159]
Predict Probability [[ 0.82431818 0.17568182 0. ..., 0. 0. 0. ]
[ 0.6 0.275 0.025 ..., 0. 0. 0. ]
[ 0.59044289 0.33256022 0.07699689 ..., 0. 0. 0. ]
...,
[ 0. 0. 0. ..., 0. 0. 1. ]
[ 0. 0. 0. ..., 0. 0. 1. ]
[ 0. 0. 0. ..., 0. 0. 1. ]]
Transform [[-1.37108833 1.28185539 0.72992665]
[-1.37108833 0.03388946 -1.4315749 ]
[-1.37108833 1.28185539 1.0901769 ]
...,
[ 1.72109748 0.24188378 0.00942613]
[ 1.72109748 0.24188378 0.00942613]
[ 1.72109748 0.24188378 0.36967639]]
Neighbors: 5, Accuracy: 0.837419
Mean : 0.910234
Classes [1 2 3 4 5 6 7]
['category_codeadvertising' 'category_codebiotech' 'category_codeecommerce'
'category_codeenterprise' 'category_codemobile' 'category_codesoftware'
'category_codeweb' 'statusoperating' 'country_codeUSA' 'funding_rounds'
'founded_year' 'investor_country_codeUSA' 'funding_round_type.yseries-a'
'funding_round_type.yseries-b' 'funding_round_type.yseries-c+'
'funding_round_type.yventure' 'funded_year.y' 'quartersQ2' 'quartersQ3'
'quartersQ4']
Feature Importances [ 0.01622837 0.01805052 0.01683086 0.02387598 0.02330386 0.02719037
0.02007616 0.04184556 0.03291276 0.17633547 0.17953565 0.01676038
0.02089646 0.03407675 0.05169245 0.02741202 0.143394 0.04255339
0.04517388 0.04185513]
Parameters {'splitter': 'best', 'min_density': None, 'compute_importances': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'random_state': None, 'criterion': 'gini', 'max_features': None, 'max_depth': None}
Mean : 0.542404
Mean : 0.425043
Coefficients [[ -6.74029896e-02 2.45308593e-02 -1.20782550e-01 -4.29517792e-03
-5.99620023e-02 6.16633533e-02 3.19169638e-02 6.87422514e-02
-3.94510785e-01 -5.54229981e-02 2.04825404e+00 1.26281471e-01
-2.24610020e+00 -1.84775494e+00 -2.19486223e+00 -6.29456530e-01
-8.88031994e-01 -2.35726543e-02 -5.97043480e-02 -2.21975639e-01]
[ -1.26700157e-02 2.52423338e-03 1.57914101e-02 -8.93324474e-03
-2.69554548e-02 6.94913922e-02 7.32423720e-02 9.29939822e-03
-3.35939026e-02 -6.86815592e-02 8.24484217e-01 -3.23513451e-01
-1.18995161e+00 -1.51903258e+00 -2.36856207e+00 -5.09394917e-01
-5.16752080e-01 -7.00693680e-02 -5.49226988e-02 -1.77407424e-02]
[ 2.55463677e-02 -1.34064882e-01 1.08793547e-02 8.92122227e-03
3.00940358e-02 1.88527141e-02 7.66543137e-02 7.21465486e-02
1.52263435e-01 -1.62661012e-01 4.94138193e-01 -2.04595507e-01
9.20764209e-02 -7.14374035e-01 -1.27932294e+00 -4.05893725e-02
-1.02502073e-01 -5.30080087e-02 -3.55057424e-02 -2.70420186e-02]
[ 1.05576633e-01 -8.34919369e-02 1.54682289e-03 3.40013585e-02
6.50640500e-02 9.45736640e-02 3.55419178e-02 3.83016559e-02
-2.37572727e-02 -3.82860598e-02 -8.50248509e-02 -1.10091834e-02
1.14921335e+00 6.12926381e-01 3.62936252e-01 7.56788790e-01
3.89810771e-02 3.17920698e-02 -1.78148967e-02 3.28055652e-02]
[ 8.30712109e-02 -1.04929319e-01 -3.79907045e-02 8.09280104e-02
1.30688033e-02 7.81698781e-02 -5.48353945e-02 5.01371207e-02
1.21308618e-02 2.00486789e-02 -2.22484585e-01 1.45890656e-01
8.93009082e-01 1.12428045e+00 8.60697927e-01 7.84077446e-01
-4.71173596e-02 6.75062589e-02 4.68888942e-02 -3.01676225e-02]
[ -1.56234462e-01 1.76853956e-01 -2.05927538e-02 -1.08639006e-01
-2.88818987e-02 -1.75801892e-01 -1.48770292e-01 -7.22909434e-02
2.90718043e-02 7.60130247e-02 -1.67084663e-01 1.70267681e-01
1.02718969e-01 7.21441498e-01 1.13186724e+00 5.08726018e-01
6.14231827e-02 -2.10530515e-02 1.94825679e-02 5.39670210e-02]
[ -2.54951191e-01 6.37519166e-02 1.11294565e-01 -6.34519717e-02
-2.72995877e-01 -1.63254149e-01 -9.49132191e-02 -2.27037212e-01
-1.13123322e-01 3.90856987e-01 -2.80016728e-01 2.25141408e-01
-1.04172375e+00 -4.36026802e-01 5.60266603e-02 -6.60389386e-01
4.71304348e-01 -1.36097181e-02 1.00112396e-02 3.92822679e-02]]
Intercept [-6.63405632 -4.18993133 -2.49310703 -1.9064998 -1.32079602 -1.46560506
-3.37443161]
Confidence Score [[ -3.80259573e-01 -1.09565235e-01 -5.75784442e-01 ..., -3.63514781e+00
-3.30702772e+00 -3.88546763e+00]
[ -2.89550464e+00 -1.22741370e+00 -1.17327836e+00 ..., -1.34007815e+00
-1.65208832e+00 -5.28072960e+00]
[ -8.81845417e-01 4.56090354e-03 -5.91749617e-01 ..., -3.98208617e+00
-3.31645315e+00 -2.22541171e+00]
...,
[ -7.49659823e+00 -5.49987479e+00 -3.74457074e+00 ..., -1.53283205e-01
-4.37126050e-01 -2.18507251e+00]
[ -7.49659823e+00 -5.49987479e+00 -3.74457074e+00 ..., -1.53283205e-01
-4.37126050e-01 -2.18507251e+00]
[ -7.92889863e+00 -6.94638761e+00 -4.70463307e+00 ..., -1.30738280e+00
3.25655230e-01 -7.15082556e-01]]
Predict Probability [[ 2.98871605e-01 3.47869784e-01 2.64896076e-01 ..., 1.89172074e-02
2.60042505e-02 1.48123317e-02]
[ 4.70373046e-02 2.03532619e-01 2.12179652e-01 ..., 1.86346152e-01
1.44433150e-01 4.54686487e-03]
[ 2.19531470e-01 3.75743638e-01 2.67095853e-01 ..., 1.37249491e-02
2.62504901e-02 7.30970078e-02]
...,
[ 5.01491478e-04 3.68043384e-03 2.08852485e-02 ..., 4.17490380e-01
3.54808091e-01 9.14076781e-02]
[ 5.01491478e-04 3.68043384e-03 2.08852485e-02 ..., 4.17490380e-01
3.54808091e-01 9.14076781e-02]
[ 3.02645211e-04 8.07926187e-04 7.54148487e-03 ..., 1.78975688e-01
4.88112960e-01 2.76103409e-01]]
Transform [[ 1.28185539 -0.54554616 -0.46470611 -0.59320985 -0.43459879 0.72992665]
[ 0.03388946 -0.54554616 -0.46470611 -0.59320985 2.30095885 -1.4315749 ]
[ 1.28185539 -0.54554616 -0.46470611 -0.59320985 -0.43459879 1.0901769 ]
...,
[ 0.24188378 -0.54554616 2.15188462 -0.59320985 -0.43459879 0.00942613]
[ 0.24188378 -0.54554616 2.15188462 -0.59320985 -0.43459879 0.00942613]
[ 0.24188378 -0.54554616 -0.46470611 1.68573385 -0.43459879 0.36967639]]
Mean : 0.304602
Probability [ 0.03121393 0.07793762 0.15145256 0.16656136 0.25540989 0.24954133
0.06788331]
Mean of each feature per class [[ -7.30472081e-02 -2.42751788e-01 3.09193182e-02 -7.09320325e-02
5.06484509e-02 5.53919504e-02 2.99860734e-01 2.37510273e-01
-6.84308334e-01 -6.04511515e-01 9.35049754e-01 -6.37359695e-01
-5.31658112e-01 -4.54011594e-01 -5.88330835e-01 -2.36539728e-01
4.72324574e-01 -1.12942448e-02 5.06365802e-02 -1.20487318e-01]
[ -2.78199574e-02 -1.94570885e-01 9.20310886e-02 -7.54475813e-02
3.72045676e-02 3.22735018e-02 2.63956834e-01 1.28778288e-01
-3.31293761e-01 -4.49596953e-01 6.59590982e-01 -6.88749844e-01
-3.36965564e-01 -4.25545968e-01 -5.85571330e-01 -5.88834102e-02
2.83973389e-01 -7.47910826e-02 -6.68228817e-04 2.27835183e-02]
[ 4.05268539e-02 -1.92112781e-01 5.34596619e-02 -1.40720514e-02
6.55814951e-02 -1.50955713e-02 1.94155710e-01 1.58522504e-01
-4.21008768e-02 -3.85384334e-01 5.36596632e-01 -3.60254010e-01
5.37352405e-01 -3.52821401e-01 -5.53262061e-01 2.51680211e-02
2.53899211e-01 -5.09007881e-02 -9.24619657e-04 7.40553422e-03]
[ 9.92603646e-02 -9.35198752e-02 -9.77075013e-03 1.38635760e-02
3.57736748e-02 6.39704781e-02 2.56726942e-02 4.73172755e-02
-3.49057302e-02 -1.29087995e-01 5.30828625e-02 -3.03795020e-02
5.95628451e-01 -5.54731107e-02 -3.31044463e-01 1.48092851e-01
-4.06622797e-02 1.07546812e-02 -3.02852615e-02 3.04536058e-02]
[ 5.96004767e-02 -4.04175487e-02 -5.84006036e-02 7.13052934e-02
-1.93649522e-02 7.53243578e-02 -8.70167835e-02 -3.94584307e-02
7.81069853e-02 9.28309245e-02 -2.40482771e-01 1.96454170e-01
-2.53315336e-02 3.66165075e-01 7.23397589e-03 4.85256005e-02
-2.11966425e-01 4.93325840e-02 2.14113973e-02 -4.27514769e-02]
[ -8.93619023e-02 2.47407841e-01 -3.95667905e-02 -3.38206464e-02
-2.16890736e-02 -9.02896723e-02 -1.35832916e-01 -1.20257397e-01
1.28549843e-01 2.89344912e-01 -3.48061361e-01 2.62250901e-01
-3.92673450e-01 1.25167059e-01 6.07402226e-01 -3.89432226e-02
-1.18462885e-01 -8.45860900e-04 -4.60110564e-03 2.75226688e-02]
[ -1.64187474e-01 2.35683260e-01 1.50002565e-01 -2.73416763e-02
-1.47506796e-01 -1.37303980e-01 -1.10374469e-01 -1.36306070e-01
1.08165557e-01 5.57791940e-01 -3.30370514e-01 2.58925088e-01
-4.90201103e-01 -2.17190767e-01 7.29409613e-01 -2.82568959e-01
2.23080907e-01 -4.26666678e-03 -9.79051440e-03 -2.32288769e-03]]
Variance of each feature per class [[ 0.72751922 0.27412012 1.12392991 0.78868296 1.15555181 1.12120442
2.00215964 0.62179042 1.6669198 0.65600474 0.29490454 0.97055848
0.0328407 0.02786879 0.0110952 0.50257458 0.51803117 0.98651735
1.06008543 0.83914516]
[ 0.89748042 0.42756474 1.36326415 0.77489017 1.11476144 1.07136198
1.89164216 0.80893396 1.43982385 0.75382269 0.62934614 0.93279015
0.45261571 0.10093254 0.01734941 0.88662902 0.74717909 0.90600234
0.99916664 1.02714537]
[ 1.14656108 0.43526874 1.21307489 0.95887254 1.20043673 0.96589725
1.66940571 0.76008927 1.06806287 0.71842576 0.69582424 1.0831846
1.40307042 0.2802383 0.08944294 1.04633309 0.75574013 0.93724183
0.99884899 1.00893309]
[ 1.35314349 0.73430794 0.9604316 1.04011882 1.11039866 1.13942746
1.09283388 0.93364688 1.05668087 0.83603551 0.91325897 1.01703065
1.41207333 0.90332362 0.52872947 1.25445702 1.00809856 1.01258955
0.96160302 1.03605234]
[ 1.21440487 0.88731566 0.76068652 1.20227458 0.93916204 1.16331977
0.67550996 1.05189748 0.86432101 0.93769331 0.98101408 0.84526064
0.9667387 1.48370291 1.00784489 1.08820543 1.01982177 1.05586792
1.02602904 0.94624467]
[ 0.66520557 1.61852087 0.83860679 0.90049524 0.93181078 0.78926554
0.48684468 1.14846362 0.77021665 1.04692357 0.79774726 0.77618188
0.34024693 1.19550631 1.29465797 0.92579528 1.10920275 0.99899346
0.99427954 1.03266265]
[ 0.37259117 1.59197236 1.58339652 0.9197332 0.51772327 0.67308291
0.58583149 1.16608977 0.80885756 1.10951053 1.13314928 0.77988136
0.1285785 0.5863825 1.26485305 0.39277329 0.94742489 0.99493282
0.9877837 0.99716739]]
Predict Probability [[ 9.78933322e-001 1.96678630e-002 1.37410903e-003 ...,
1.95204378e-006 1.13551826e-006 6.06241859e-009]
[ 3.11526057e-002 8.30995038e-001 9.63635979e-002 ...,
6.27341074e-003 3.95799808e-003 7.07936703e-006]
[ 8.65832248e-001 1.31709138e-001 2.28837644e-003 ...,
9.23904745e-008 2.34057676e-007 1.62951558e-004]
...,
[ 4.14479493e-054 1.76284897e-015 1.56516839e-006 ...,
3.15076043e-001 6.26708426e-001 4.87674155e-002]
[ 4.14479493e-054 1.76284897e-015 1.56516839e-006 ...,
3.15076043e-001 6.26708426e-001 4.87674155e-002]
[ 2.37605421e-103 1.21535194e-066 8.70440996e-015 ...,
1.40258323e-002 2.73640432e-001 7.12103237e-001]]
Pred [1 5 5 ..., 2 2 1]
Mean : 0.909469
Feature Importances [ 0.01574237 0.0152938 0.01492231 0.0184253 0.01671296 0.0227888
0.01556403 0.02954133 0.02546658 0.1940718 0.22034108 0.01850768
0.03806563 0.02549743 0.04026472 0.01668293 0.17064961 0.0338506
0.03442094 0.03319011]
Predict Probability [[ 0.74777778 0.25222222 0. ..., 0. 0. 0. ]
[ 0.8 0. 0.2 ..., 0. 0. 0. ]
[ 0.60727273 0.33687166 0.05585561 ..., 0. 0. 0. ]
...,
[ 0. 0. 0. ..., 0. 0. 1. ]
[ 0. 0. 0. ..., 0. 0. 1. ]
[ 0. 0. 0. ..., 0. 0. 1. ]]
Transform [[-1.37108833 1.28185539 0.72992665]
[-1.37108833 0.03388946 -1.4315749 ]
[-1.37108833 1.28185539 1.0901769 ]
...,
[ 1.72109748 0.24188378 0.00942613]
[ 1.72109748 0.24188378 0.00942613]
[ 1.72109748 0.24188378 0.36967639]]
Pred [1 5 5 ..., 2 2 1]
Mean : 0.910289
Feature Importances [ 0.00976162 0.01121367 0.01210828 0.01586263 0.01547273 0.01839148
0.00946322 0.02490826 0.02157679 0.17348857 0.25357233 0.02079125
0.03249232 0.03153995 0.04896464 0.01553601 0.19668538 0.02954983
0.03039174 0.0282293 ]
Pred [1 5 5 ..., 2 2 1]
Mean : 0.909833
Feature Importances [ 0.01114193 0.01207747 0.01244962 0.01568419 0.01520974 0.01847939
0.01000801 0.02658796 0.01330737 0.18679507 0.25358345 0.01780327
0.02927668 0.0283059 0.05228376 0.02188926 0.18886312 0.0290693
0.03076994 0.02641458]
Pred [1 5 5 ..., 2 2 2]
Mean : 0.475987
Feature Importances [ 0.01622699 0.01101957 0.00386317 0.00910628 0.01038051 0.02331455
0.01389899 0.03001674 0.0235695 0.08343904 0.18106806 0.08129819
0.12179502 0.09331017 0.0992527 0.06905604 0.11523619 0.00512814
0.00045272 0.00856743]
Comparison between classifiers¶
|
Accuracy |
| knn |
0.837419 |
| Trees |
0.910234 |
| SVM |
0.542404 |
| Logistic Regression |
0.425043 |
| Naive Bayes |
0.304602 |
| Random Forest |
0.909469 |
| Adaboost |
0.910289 |
| Extra Trees |
0.909833 |
| Gradient Boosting |
0.475987 |
Analyzing knn, Trees, Random Forest¶
Adjacency, Two-mode Incidence, Similarity and Overlap Network Graphs for investors and the companies invested in.¶
Adjacency and Two-mode Incidence Graphs for acquirers and the companies acquired.¶
Adjacency, Two-mode Incidence and Overlap Network Graphs for acquirer regions and regions of companies acquired.¶